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Parallel D2-Clustering: Large-Scale Clustering of 

Discrete Distributions 

Yu Zhang, James Z. Wang, Senior Member, IEEE, and Jia Li, Senior Member, IEEE 

Abstract — The discrete distribution clustering algorithm, namely D2-clustering, has demonstrated its usefulness in image classification 
and annotation where each object is represented by a bag of weighed vectors. The high computational complexity of the algorithm, 
however, limits its applications to large-scale problems. We present a parallel D2-clustering algorithm with substantially improved 
scalability A hierarchical structure for parallel computing is devised to achieve a balance between the individual-node computation 
and the integration process of the algorithm. Additionally it is shown that even with a single CPU, the hierarchical structure results in 
significant speed-up. Experiments on real-world large-scale image data, Youtube video data, and protein sequence data demonstrate 
the efficiency and wide applicability of the parallel D2-clustering algorithm. The loss in clustering accuracy is minor in comparison with 
the original sequential algorithm. 

Index Terms — Discrete Distribution, Transportation Distance, Clustering, D2-Clustering, Parallel Computing, Large-Scale Learning, 
Image Annotation, Sequence Clustering. 
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1 Introduction 

THE transportation problem is about finding an op- 
timal matching between elements in two sets with 
different Radon measures under a certain cost function. 
The problem was initially formulated by French mathe- 
matician G. Monge in the late 18th century Il29l , and later 
by Russian mathematician L. Kantorovich in 1942 fT6). 
The solution of a transportation problem defines a metric 
between two distributions, which is normally known 
as the Kantorovich-Wasserstein metric. With the wide 
applications of the transportation problem in different 
fields, there are several variants, as well as names, of the 
corresponding transportation metric. Readers can refer 
to |3TI for these applications. 

The Kantorovich-Wasserstein metric is better known 
as the Mallows distance [261 ir* the statistics community 
because C. L. Mallows used this metric (with L2-norm 
as the cost function) to prove some asymptotic property. 
In the computer science community, the metric is often 
called the Earth Mover's Distance (EMD) |j32J. Hereafter 
we refer to the Kantorovich-Wasserstein metric as the 
Mallows distance. It is not until the recent decade that 
the Mallows distance has received much attention in 
data mining and machine learning. 

The advantages of the Mallows distance over a usual 
histogram-based distance include its effectiveness with 
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sparse representations and its consideration of cross- 
term relationships. We have found abundant applica- 
tions of the Mallows distance in video classification 
and retrieval ||34^, f38l, sequence categorization 1(21], and 
document retrieval [36], where objects are represented 
by discrete distributions. The Mallows distance is also 
adopted in a state-of-the-art real-time image annotation 
system named ALIPR ||22|, in which a clustering algo- 
rithm on discrete distributions was developed. 

Clustering is a major unsupervised learning methodol- 
ogy for data mining and machine learning. The K-means 
algorithm is one of the most widely used algorithms 
for clustering. To a large extent, its appeal comes from 
its simple yet generally accepted objective function for 
optimization. The original K-means proposed by Mac- 
Queen [251 only applies to the Euclidean space, which 
is a fundamental limitation. Though there are quite a 
few attempts to extend K-means to various distances 
other than the Euclidean distance QH), 113, 123, these K- 
means' variants generally work on vector data. In many 
emerging applications, however, the data to be processed 
are not vectors. In particular, bags of weighted vec- 
tors, which can be formulated as discrete distributions 
with finite but arbitrary support, are frequently used 
as descriptors for objects in information and image re- 
trieval ||5l. The discrete distribution clustering algorithm, 
also known as the D2-cliistering ||22| , tackles this type 
of data by optimizing an objective function defined in 
the same spirit as that used by K-means, specifically, to 
minimize the sum of distances between each object and 
its nearest centroid. 

Because linear programming is needed in D2- 
clustering and the number of unknown variables grows 
with the number of objects in a cluster, the computa- 
tional cost would normally increase polynomially with 
the sample size (assuming the number of clusters is 
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roughly fixed). For instance, ALIPR takes several min- 
utes to learn each semantic category by performing D2- 
clustering on 80 images in it. However, it will be difficult 
for ALIPR to include over 1 , 000 training images in one 
semantic category, because on a CPU with a clock speed 
of 2 GHz, the category may demand more than a day 
to cluster. There are some alternative ways fMl , |40ll to 
perform the clustering faster, but they generally perform 
some indirect or sub-optimal clustering instead of the 
optimization based D2-clustering. 

The clustering problems researchers tackle today are 
often of very large scale. For instance, we are seeking 
to perform learning and retrieval on images from the 
Web containing millions of pictures annotated by one 
common word. This leads to demanding clustering tasks 
even if we perform learning only on a fraction of the Web 
data. Adopting bags of weighted vectors as descriptors, 
we will encounter the same issue in other applications 
such as video, biological sequence, and text clustering, 
where the amount of data can be enormous. In such 
cases, D2-clustering is computationally impractical. 

Given the application potential of D2-clustering and 
its theoretical advantage, it is of great interest to develop 
a variant of the algorithm with a substantially lower 
computational complexity, especially under a parallel 
computing environment. With the increasing popularity 
of large-scale cloud computing in information systems, 
we are well motivated to exploit parallel computing. 

1.1 Preliminaries 

D2-clustering attempts to minimize the total squared 
Mallows distance between the data and the centroids. 
Although this optimization criterion is also used by K- 
means, the computation in D2-clustering is much more 
sophisticated due to the nature of the distance. The D2- 
clustering algorithm is similar to K-means in that it 
iterates the update of the partition and the update of 
the centroids. There are two stages within each iteration. 
First, the cluster label of each sample is assigned to be 
the label of its closest centroid; and secondly, centroids 
are updated by minimizing the within-cluster sum of 
squared distances. To update the partition, we need 
to compute the Mallows distance between a sample 
and each current centroid. To update the centroids, an 
embedded iterative procedure is needed which involves 
a large-scale linear programming problem in every it- 
eration. The number of unknowns in the optimization 
grows with the number of samples in the cluster. 

Suppose vi,V2, ■ ■ . , Vn' are discrete distribu- 
tions within a certain cluster currently and 

= {(v^\p^).---.iv^'\l^% * = l,2,...,n', 
the following optimization is pursued in order to 

update the centroid z = {{z^^\pi^^), . . . , (z'^^'^pi"* ')}. 
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E^^^" = l,---,n',a = l,...,U , 

r=l 

wi% > 0, i = 1, . . . , n', a = 1, . . . , r = 1, . . . , s' . 

If z is fixed, D{z, Vi) is in fact the Mallows distance 
between z and Vi. However in we cannot separately 
optimize D{z, Vi), i = 1, n' , because z is involved 
and the variables w'^r^L for different i's affect each other 
through z. To solve the complex optimization problem, 
the D2-clustering algorithm adopts the following itera- 
tive strategy. 

Algorithm 1. Centroid Update Process of D2-CIustering. 

Step 1 Fix z'^^^'s and update p"j\ r = 1, s' , and tu'-'l, « = 1, 
n' , r — I, s', Q = 1, ti. In this case the optimization ([l) 

of pi^^'s and will's reduces to a linear programming. 
Step 2 Then fix pi'^''s and wi'X's and update z'' ', r — 1, s'. 

In this case the solution of ^'''''s to optimize (ij are simply 

weighted averages: 

V"' V'' 1/)''^ 7)'°^ 



V^n' v^ti (i) 1,...,S 



(2) 



Step 3 Calculate the objective function e in {T) using the 
updated z'''''s, pi'^''s, and lUr^a's. Terminate the iteration until 
e converges. Otherwise go back to Step 1. 

The linear programming in Step 1 is the primary 
cost in the centroid update process. It solves pz , Wr'L 
r = 1, s', i = 1, n' , a = 1, U, a total of 



parameters when s' is set to 



where 



be the average of the t/s. The number of constraints 
is 1 + n' s' + X]"=i — 2r7,'.s'. Hence, the number of 
parameters in the linear programming grows linearly 
with the size of the cluster. Because it costs polynomial 
time to solve the linear programming problem [17J, D2- 
clustering is much more complex than K-means algo- 
rithm in computation. 

The analysis of the time complexity of K-means re- 
mains an unsolved problem because the number of 
iterations for convergence is difficult to determine. The- 
oretically, the worst case time complexity for K-means is 
17(2^/") [2]. Arthur et al. [1] show recently that K-means 
has polynomial smoothed complexity, which reflects the 
fact that K-means will converge in a relatively short time 
in real cases (although no tight upper or lower bound of 
time complexity has been proved). In practice, K-means 
usually converges fast, and seldom costs an exponential 
number of iterations. In many cases K-means converges 
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in linear or even sub-linear time IITOl . Although there 
is still a gap between the theoretical explanation and 
practice, we can consider K-means an algorithm with 
at most polynomial time. 

Although the D2-clustering algorithm interlaces the 
update of clustering labels and centroids in the same 
manner as K-means, the update of centroids is much 
more complicated. As described above, to update each 
centroid involves an iterative procedure where a large- 
scale linear programming problem detailed in (HJ has 
to be solved in every round. This makes D2-clustering 
usually polynomially more complex than K-means. 

The computational intensiveness of D2-clustering lim- 
its its usages to only small-scale problems. With emerg- 
ing demands to extend the algorithm to large-scale 
datasets, e.g., online image datasets, video resources, and 
biological databases, we exploit parallel processing in a 
cluster computing environment in order to overcome the 
inadequate scalability of D2-clustering. 

1.2 Overview of Parallel D2-Clustering 

To reduce the computational complexity of D2- 
clustering, we seek a new parallel algorithm. Paral- 
lelization is a widely adopted strategy for accelerat- 
ing algorithms. There are several recent algorithms |4|. 
1351 , 15711 that parallelize large-scale data processing. By 
distributing parallel processes to different computation 
units, the overall processing time can be greatly reduced. 

The labeling step of D2-clusterLng can be easily par- 
allelized because the cluster label of each point can 
be optimized separately. The centroid update stage de- 
scribed by which is the performance bottleneck of 
D2-clustering, makes the parallelization difficult. Unlike 
K-means' centroid update, which is essentially a step of 
computing an average, the computation of a centroid in 
D2-clustering is far more complicated than any linear 
combination of the data points. Instead, all the data 
points in the corresponding cluster will play a role in 
the linear programming problem in ([ij. 

The averaging step of K-means can be naturally paral- 
lelized. We divide the data into segments, and compute 
a local average of each segment in parallel. All the local 
averages can then be combined to a global average of 
the whole group of data. This is usually how K-means 
is being parallelized (9), flSl . To the best of our knowl- 
edge, there is no mathematical equivalence of a parallel 
algorithm for the linear programming problem ([ij in D2- 
clustering. However, we can adopt a similar strategy to 
parallelize the centroid update in D2-clustering: divid- 
ing the data into segments based on their adjacency, 
computing some local centroids for each segment in 
parallel, and combining the local centroids to a global 
centroid. Heuristically we can get a good estimation of 
the optimal centroid in this way, considering that the 
notion of centroid is exactly for a compact representation 
of a group of data. Hence, a centroid computed from the 
local centroids should represent well the entire cluster, 
although some loss of accuracy is expected. 



The most complex part in the parallelization of D2- 
clustering is the combination of local centroids. Same 
as the computation of local centroids, the combination 
operation also involves a linear programming with poly- 
nomial time in terms of data size. To achieve a low over- 
all runtime, we want to keep both linear programming 
problems at a small scale, which cannot be achieved 
simultaneously when the size of the overall data is large. 
As a result, we propose a new hierarchical structure to 
conduct parallel computing recursively through multiple 
levels. This way, we can control the scale of linear 
programming in every segment and at every level. 

We demonstrate the structure of the algorithm in 
Fig. [TJ which will be explained in detail in Section 12.11 
In short, we adopt a hierarchically structured parallel 
algorithm. The large dataset is first clustered into seg- 
ments, within each of which D2-clustering is performed 
in parallel. The centroids generated for each segment are 
passed to the next level of the hierarchy and are treated 
as data themselves. The data size in the new level can 
still be large, although it should be much smaller than 
the original data size. At the new level, the same process 
is repeated to cluster data into segments and perform 
D2-clustering in parallel for each segment. When the 
amount of data at a certain level is sufficiently small, no 
further division of data is needed and D2-clustering on 
all the data will produce the final result. We can think of 
the granularity of the data becoming coarser and coarser 
when the algorithm traverses through the levels of the 
hierarchy. Details about each step will be introduced in 
the following sections. 

Particularly for implementation, parallel computing 
techniques, such as Message Passing Interface (MPl) Il35l 
and MapReduce [8J can be applied. In the current work, 
MPl is used to implement the parallel algorithm. 

1 .3 Outline of the Paper 

The remainder of paper is organized as follows. The 
parallel D2-clustering algorithm and its performance 
analysis are presented in Section |2l Experimental results 
and comparison with other methods are described in 
Section |3l Finally, we conclude in Section HI 

2 Parallel D2-clustering Algorithm 

In this section, we present the parallel D2-clustering al- 
gorithm. Hereafter, we refer to the original D2-clustering 
as the sequential algorithm. We will describe the hier- 
archical structure for speeding up D2-clustering and a 
weighted version of D2-clustering which is needed in 
the parallel algorithm but not the sequential. We will also 
describe in detail the coordination among the individual 
computation nodes (commonly called the slave proces- 
sors) and the node that is responsible for integrating the 
results (called the master processor). Finally, we analyze 
the computational complexity of the parallel algorithm 
and explain various choices made in the algorithm in 
order to achieve fast speed. 
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Fig. 1. An illustration of the structure of the parallel D2- 
clustering algorithm. Crossings are data to be clustered at 
each level. Different shadings denote different segments 
to perform local D2-clustering, which can be done in par- 
allel within a level. Dots are cluster centroids. Centroids at 
a certain level become data points at the next level. The 
clustering stops going to a higher level until we get the 
desired number of clusters. 

2.1 Hierarchical Structure 

Fig. [l] illustrates the hierarchical structure used for par- 
allel processing. All input data are denoted by crossings 
at the bottom level (Level 1) in Fig. [l] Because of the 
large data size, these data points (each being a discrete 
distribution) are divided into several segments (reflected 
by different shadings in the figure) which are small 
enough to be clustered by the sequential D2-clustering. 
D2-clustering is performed within each segment. Data 
points in each segment are assigned with different clus- 
ter labels, and several cluster centroids (denoted by blue 
dots in the figure) are obtained. 

These cluster centroids are regarded as good summa- 
rization for the original data. Each centroid approximates 
the effect of the data assigned to the corresponding 
cluster, and is taken to the next level in the hierarchy. Ob- 
viously, the number of data points to be clustered at the 
second level is much reduced from the original data size 
although it can still be large for sequential D2-clustering. 
Consequently, the data may be further divided into small 
segments which sequential D2-clustering can handle. In 
our implementation, we keep the size of a segment at 
different levels roughly the same. When the algorithm 
proceeds to higher levels, the number of segments de- 
creases. The hierarchy terminates when the data at a 
level can be put in one segment. 

In short, at each level, except the bottom level, we 
perform D2-clustering on the centroids acquired from 
the previous level. Fig. [T] shows that the data size at 
the third level is small enough for not proceeding to 
another level. Borrowing the terminology of a tree graph, 
we can call a centroid a parent point and the data 
points assigned to its corresponding cluster child points. 
The cluster labels of the centroids at a higher level are 



inherited by their child points, and propagate further to 
all the descendent points. To help visualize the tree, in 
Fig. [TJ data points in one cluster at a certain level are 
connected to the corresponding centroid using dashed 
lines. And the red arrows between levels indicate the 
correspondence between a centroid and a data point at 
the next level. To determine the final clustering result for 
the original data, we only need to trace each data point 
through the tree and find the cluster label of its ancestor 
at the top level. 

In such a hierarchical structure, the number of cen- 
troids becomes smaller and smaller as the level increases. 
Therefore it is guaranteed that the hierarchy can termi- 
nate. Also since D2-clustering is conducted within small 
segments, the overall clustering can be completed fast. 
More detailed discussion about the complexity of the 
algorithm will be provided in a later section. 

2.2 Initial Data Segmentation 

We now describe the initial data segmentation method. 
In this step, we want to partition the dataset into groups 
all smaller than a certain size. The partitioning process 
is in a similar spirit as the initialization step of the LBG 
algorithm proposed by Linde et al. [24], an early instance 
of K-means in signal processing. The whole dataset is 
iteratively split into partitions until a stopping criterion 
is satisfied. And within each iteration, the splitting is an 
approximation of, and computationally less expensive 
than, an optimal clustering. 

Fig. |2] illustrates the iterative process containing a 
series of binary splitting. Different from the initialization 
of the LBG algorithm, which splits every segment into 
two parts, doubling the total number of segments in 
every iteration, our approach only splits one segment in 
each round. The original data are split into two groups 
using a binary clustering approach (to be introduced 
later). Then we further split the segments by the same 
process repetitively. Each time we split the segment 
containing the most data points into two clusters, so 
on so forth. The splitting process will stop when all the 
segments are smaller than a certain size. 

As shown by Fig. 2, in each roimd of dividing a group 
of data into two segments, we aim at good clustering so 
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that data points in each segment are close. On the other 
hand, this clustering step needs to be fast since it is a 
preparation before conducting D2-clustering in parallel. 
The method we use is a computationally reduced version 
of D2-clustering. It is suitable for clustering bags of 
weighted vectors. Additionally, it is fast because of some 
constraints imposed on the centroids. We refer to this 
version as the constrained D2-cliistering. 

The constrained D2-clustering is a reduced version of 
D2-clustering. We denote each data point to be clustered 
as V, = {{v'l^Kpi]^), . . . , ivf'\p[l'^)},i = 1, . . . ,n, and k 
initial centroids as Zj = . . . , iz''/'\pi''/^)}d = 

1, . . . , fc. Parameters ti and Sj are the numbers of support 
vectors in the corresponding discrete distributions Vi 
and Zj. And C{i) e denotes the cluster label 

assigned to Vi. The D2-clustering and constrained D2- 
clustering algorithms are described below. 

Algorithm 2. D2-CIustering and Constrained D2-Clustering. 

Step 1 Select some random centroids as the initialization. 
Step 2 Allocate a cluster label to each data point Vi by select- 
ing the nearest centroid. 

C(i) = argminj D{vi, Zj), 

where -D(ui, zj) is the Mallows distance between Vi and Zj. 

Step 3 (For D2-clustering only) Solve ((l} to update Zj, the 
centroid of each cluster, using the data points {vi.c(i)=j} 
assigned to the cluster. Algorithm 1, which is also an iterative 
process, is invoked here to solve Zj. 

Step 3* (For constrained D2-clustering only) Keep pi^' = 
1/sj, r = 1, . . . , Sj fixed when invoking Algorithm 1 to solve 
Zj. The linear programming can thus be separated into 
several smaller ones and runs much faster. The result of 
will be a uniform weighted bag of vectors containing support 
vectors at optimized locations. 

Step 4 Suppose the objective function in converges to ej 
after updating Zj. Calculate the overall objective function for 
the clustering: 

If e(C) converges, the clustering finishes; 
otherwise go back to Step 2. 



The difference between D2-clustering and constrained 
D2-clustering is the way to update centroids in Step 3 
of the above algorithm. As discussed previously in 
Section 11.11 we need to solve a large-scale linear pro- 
gramming problem in Algorithm 1 to update each Zj 
in D2-clustering. The linear programming involves the 
optimization of p^zj and w^r^a, r — \, . . . ^ Sj, a ~ \, . . . ^ti, 
i : C{i) = j {'Wr% is the matching weight between the 
r-th vector of Zj and the a-th vector of Vi). This makes 
D2-clustering computationally complex. 

In the constrained D2-clustering algorithm, we replace 
Step 3 by Step 3* in the D2-clustering algorithm. We 
simplify the optimization of D2-clustering by assuming 
uniform p'f^^ ~ ^1 Sji f ~ I7 • • • i Sji for any j = 1, . . . , fc. 
With this simplification, in the linear programming prob- 
lem in Step 1 of Algorithm 1, we only need to solve 
Wr,a's. Moreover, Wr,a's can be optimized separately for 
different i's, which significantly reduces the number of 



parameters in a single linear programming problem. 
More specifically, instead of solving one linear program- 
ming problem with a large number of parameters, we 
solve many linear programming problems with a small 
number of parameters. Due to the usual polynomial 
complexity of the linear programming problems, this can 
be done much faster than the original D2-clustering. 

2.3 Weighted D2-clustering 

Data segments generated by the initialization step are 
distributed to different processors in the parallel algo- 
rithm. Within each processor, a D2-clustering is per- 
formed to cluster the segment of data. Such a process 
is done at different levels as illustrated in Fig. [T] 

Clustering at a certain level usually produces unbal- 
anced clusters containing possibly quite different num- 
bers of data points. If we intend to keep equal contribu- 
tion from each original data point, the cluster centroids 
passed to a higher level in the hierarchy should be 
weighted and the clustering method should take those 
weights into account. We thus extended D2-clustering to 
a weighted version. As will be shown next, the extension 
is straightforward and results in little extra computation. 

It is obvious that with weighted data, the step of 
nearest neighbor assignment of centroids to data points 
will not be affected, while the update of the centroids 
under a given partition of data will. Let us inspect 
the optimization in ^ for updating the centroid z of 
one cluster. When data Vi have weights uji, a natural 
extension of the objective function is 

n' 

min y2uJiD'^{z,Vi) (3) 

z('-),pl''':l<r<s' 

where D{z,Vi) is defined in |(T). 

In Step 1 of Algorithm 1, we fix Zj^\ r = l,...,Sj, 
and solve a linear programming problem with objective 
function (|3]l under the same constraints specified in |l]l 
to get the vector weights p^l}'s and a matching uji% 
between Zj and each Vi, i : C{i) = j. Then in Step 2, 
the update for z^p is revised to a weighted version: 

Because the optimal solution of centroid is not affected 
when the weights are scaled by a common factor, we 
simply use the number of original data points assigned 
to each centroid as its weight. The weights can be 
computed recursively when the algorithm goes through 
levels in the hierarchy. At the bottom level, each original 
data point is assigned with weight 1. The weight of a 
data point in a parent level is the sum of weights of its 
child data points. 

2.4 Algorithmic Description 

With details in several major aspects already explained 
in the previous sections, we now present the overall 
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Fig. 3. Worl< flow and communication among master and 
slave processors in parallel D2-clustering for one iteration. 
Within each iteration (level), there are three synchronized 
steps for the master and slave processors. In each step, 
one of the master or slave side is performing computation 
(marl<ed in solid line) and the other is idle and waiting for 
incoming data and message (marl<ed in dashed line). 



slave processor will cluster tlie data it received by the 
weighted D2-clustering algorithm independently, and 
send the clustering result back to the master processor. 
As soon as the master processor receives all the within- 
segment clustering results from the slave processors, 
it will combine these results and set up the data for 
clustering at the next level if necessary. 

The current work adopts MPI for implementing the 
parallel program. In order to keep a correct work flow 
for both master and slave processors, we use synchro- 
nized communication ways in MPI to perform message 
passing and data transmission. A processor can broadcast 
its local data to all others, and send data to (or receive data 
from) a specific processor. Being synchronized, all these 
operations will not return until the data transmission 
is completed. Further, we can block the processors until 
all have reached a certain blocking point. In this way, 
different processors are enforced to be synchronized. 

Because the communication is synchronized, we can 
divide the overall work flow into several steps for both 
the master and slave processors. To better explain the 
algorithm, in particular how the master and slave pro- 
cessors interact, we describe the working processes of 
both the master and slaves together based on these steps, 
although they are essentially different sets of instructions 
performed at different CPU cores respectively. 

The work flow is illustrated in Fig. |3] There are one 
master processor (on the left of the figure) and M slave 
processors (on the right of the figure). The master pro- 
cessor will read the original set of discrete distributions 
X = {xi,x2, ■ ■ ■ ,xn} and maintain the clustering label 
C{i) G {!,..., fc'} for each data entry Xi, as well as 
the corresponding cluster centroids Z = {zi, Z2, Zfc'}. 
Initially the weight of each data entry is set to be Wi = 1, 
i = 1, . . . , N. At each level, there are three steps. In each 
step, either the master processor or the slave processors 
are idle, passively waiting and receiving incoming data, 
and the other one is performing computation (which 
are marked by dashed and solid icons in the figure 
respectively). Suppose the targeted number of clusters is 
k and the sequential (weighted) D2-clustering is applied 
to segments of data containing no more than r entries. 
The algorithm has a hierarchical structure and iteratively 
performs clustering level by level. And each iteration 
contains the following three steps. 

Algorithm 3. Parallel D2-Clustering. 



flow of the parallel D2-clustering algorithm. The parallel 
program runs on a computation cluster with multiple 
processors. In general, there is a "master" processor 
that controls the flow of the program and some other 
"slave" processors in charge of computation for the 
sub-problems. In the implementation of parallel D2- 
clustering, the master processor will perform the ini- 
tial data segmentation at each level in the hierarchi- 
cal structure specified in Fig. [H and then distribute 
the data segments to different slave processors. Each 



Step 1 Data segmentation in the master's side. 

• Tlie master processor segments the data objects G (con- 
taining n objects) at current level into m groups, Gi, 
Gm, each of which contains no more than r entries, using 
the constrained D2-clustering introduced in Section lZ2l The 
indices of data entries assigned to segment are stored in 
array L^. For each segment G^, the master processor sends 
the segment index /i, data in the segment and their weights 
to the slave processor with index [ji mod A/). 

• Each slave processor will listen to and receive any incoming 
data segment from the master processor in this step. It will 
store data in the segment = {v^i,l,v^la^ ■ ■ ■ i^m,",.} ^'^'^ 
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the corresponding weights = {cj^,i,a;p,2, . . . 
locally for clustering. In addition, the index of the cor- 
responding segment, /i, is also transmitted and kept in 
the slave processor. Since the number of data segments m 
might be larger than the number of slave processors A/, 
the master processor uses a modular operation to distribute 
different segments to different slave processors and each 
slave processor may handle multiple segments sequentially. 
The segment indices help the master processor manage the 
data distribution and identify the result of a certain segment 
when it is returned by a slave processor in the next step. 
Step 2 Parallel D2-clustering in the slaves' side. 

• The master processor listens to the clustering results, in- 
cluding both the labels (containing labels) and 
centroids (containing fc^ centroids), as well as the index 
fi, for each segment, and is blocked until it receives all 
segments' results. 

• Each slave processor will perform weighted D2-clustering 
on with weights VV,, by Algorithm 1 (using y) 
and (31) when updating centroids. Then a set of cen- 
troids Zfj, = {zpa, ■ ■ ■ ,2M,fej,} and cluster labels C^j — 
{C^(l), . . . , C^(np)} are acquired, where C^{i) = j if u^,; 
is assigned to centroid z^j. After this it sends back the 
segment index fi, the set of centroids and cluster labels 
Cf_i to the master processor. 

Step3 Work flow control in the master's side. 

• The master processor will merge the clustering results of 
all the segments and then determine whether to proceed to 
the next level. 

The current overall centroid list is set to Z — 9 at the 
beginning. The centroids Z^ for group are appended to 
Z one by one. And the labeling list of G, C{i),i = 1, ...,n, 
is updated once a new group of centroids is being added. 
Let the most recent list Z before adding Z^ be z'**"^^ 
(Z(°) = 0). C'(-) is computed by ©. 

C{L^{i)) = C^{i) + \Z'^^-'^\,i = 1, ...,n^ . (5) 

After all Z^ is merged to Z and all C{-) is computed, we 
update the cluster label of a data entry in the original 
dataset (bottom level), C{i), by inheriting label from its 
ancestor in G. 

G{i) = C{C{i)),i = l,...,N . (6) 

Assvmie fe^ is the number of clusters in segment G,j and 
A:' = fcp. At the end of the current level of clustering, 

we get a list of centroids, Z = {zi, Z2, Zk'}, and the 
cluster labels G(l), C(2), G{N) for the original data. 
If k' < k, clustering is accomplished. The master processor 
will output Z and C, and sends a termination signal to all 
the processors. 

Otherwise clustering at a higher level is needed. Let oj^ — 
LOi, i = 1, . . . ,n. Update G — Z and assign new weights to 
each entry in G by cj^ = X]Q C{a)=i "^a' * = 1, ■ ■ • , fe'- Then 
the master processor will go back to Step 1 and send signals 
to slave processors to tell them also to return to Step 1. 

• The slave processors are blocked in this step, doing nothing 
imtil the master processor tells them the next action, either 
to return to Step 1 and receive new data, or to terminate. 



2.5 Run-Time Complexity 

In this section, we analyze the time complexity of the 
parallel D2-clustering algorithm. 

As described in Section [TJ for D2-clustering, the com- 
putational complexity of updating the centroid for one 



cluster in one iteration is polynomial in the size of the 
cluster. The total amount of computation depends on 
computation within one iteration as well as the number 
of iterations. Again, as discussed in Section[TJ even for K- 
means clustering, the relationship between the number 
of iterations needed for convergence and the data size 
is complex. In practice, we often terminate the iteration 
after a pre-specified number of rounds (e.g., 500) if the 
algorithm has not converged before that. As a result, in 
the analysis of computational complexity, we focus on 
the time needed to complete one iteration and assume 
the time for the whole algorithm is a multiple of that, 
where the multiplication factor is a large constant. 

Suppose there are N data entries in total and we want 
to cluster them into k clusters. Assuming the clusters are 
of similar sizes, each cluster contains roughly data 
entries. The sequential D2-clustering algorithm will run 
in approximately O (fc(^)''') time, where 7 is the degree 
of the polynomial time required by linear programming 
for computing the centroid of one cluster. In many 
applications, when N increases, the number of clusters 
k may stay the same or increase little. As a result, the 
time for D2-clustering increases polynomially with N. 

Now consider the time required by the parallel D2- 
clustering. The sequential D2-clustering is then per- 
formed on segments of data with sizes no larger than 
T. Each segment is clustered into c groups with roughly 
equal size e. Thus, r = ce. After clustering every data 
segment, the number of data to be processed at the next 
level of the hierarchy shrinks from to since we use 
each cluster centroid to summarize data entries assigned 
to the corresponding cluster. 

Based on the rim-time analysis on the sequential clus- 
tering, the time for clustering one segment of data is 
O(ce^) = 0{Te''^^). The combined CPU time used for 
clustering all segments is then 0{Ne''^^). The actual time 
for performing clustering also depends on the number of 
available parallel CPU cores. Assuming there are M + 1 
CPUs, where M is the number of CPUs performing slave 
clustering operations and there is one CPU serving as the 
master. When AI > ^, it can be guaranteed that each 
segment is distributed to a unique CPU. Therefore the 
actual time used is 0{Te'^~^). When M < ^, there will 
be some CPUs handling more than one clustering jobs 
sequentially, so the actual time is O {yI^^^^)- 

In addition to clustering the segments, it also takes 
time to divide data into segments by the approach de- 
scribed in Section 1221 The data segments are generated 
by a binary tree-structured clustering process. For the 
simplicity of analysis, we assume the tree is balanced. 
The time to split a node is linear to the number of data 
entries in the node; and the time for creating ^ segments 
is O (A^log ■^). Similarly, there is also some computation 
for initialization when clustering each segment, which 
takes O (rlog ^) time. When r is large, the initialization 
cannot be neglected for run- time analysis. In summary, 
the overall actual time for performing clustering at the 
first level in Fig. [T] is 
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(7) 



The hierarchical clustering may not reach the desired 
number of clusters at the first level, where — cluster 

' e 

centroids are acquired. If — > fc, the algorithm proceeds 
to the next level, where each cluster centroid will be 
treated as a data entry. If we keep the same values for 
parameters r, e, and M , the time for the second level 
clustering can be calculated simply using ||Z| with N 
replaced by More generally, we can obtain the time 
complexity for any /-th level, / > 2, by 



N N 
nO = — log-^ 



o7-l 



+ logf 



-N 



(8) 



min (Af, ^)e'-i j 

Obviously T{1) > eT{l + 1). The algorithm will termi- 
nate at level L when < k. Hence, the highest level 
L = [logj ^'j . And the total run-time is T(l)+. . .+T{L) < 
^T(l) < 2T(1). The last inequality comes from the fact 
e > 2 (If e = 1, there is essentially no clustering). We thus 
see that the computational order of the total run-time is 
the same as that of T(l). 

With fixed N and r, the complexity T(l) in ^ in- 
dicates that a small value of e is favored. Another 
benefit of having a small e is that a centroid can better 
represent a small number of data entries. We set e = 5 
in our implementation. We do not go as low as e = 2 
because more levels will be needed which makes the 
coordination among processors harder to track. 

To achieve small T(l), Q shows that there is a trade- 
off for T. Given the number of slave CPUs, A/, which 
is inherently determined by the hardware, we try to 
choose a sufficiently large r so that the number of 
segments is no greater than M. This ensures that no 
CPU needs to process multiple segments sequentially. 
Moreover, large r will reduce the time for generating 
the segments, which corresponds to the first term in IZll. 
On the other hand, when r is large, assuming < M, 
the second term in l|71l is in the linearithmic order of t. 
Hence, r should not be unnecessarily large. If AI slave 



processors are guaranteed, r 



N_ 

M 



is a good choice. In 



our implementation, we observe that the sequential D2- 
clustering can handle 50 data entries and 10 clusters (5 
entries per cluster) at satisfactory speed. We thus set 
r = 50 and e = 5 in the parallel algorithm by default. 

Comparing the two terms in Ql, the first term dom- 
inates when N is large. Otherwise, the second term 
dominates because e"^^^ can be substantial. Assuming 
parameters M, r, and e are fixed, when ^ < M, T(l) will 
be dominated by a constant given in the second term. As 
N grows, T(l) will eventually be dominated by the first 
term, which is a linearithmic order (0(A^ log A^)). 

Based on the analysis above, we can see the speedup 
of the parallel D2-clustering over the original sequential 
algorithm, which runs in time O ( fl^-i ) ■ Usually N ^ k, 
so the sequential algorithm has to solve a large linear 
program when is large. In the parallel algorithm, 
during the update of a centroid, on average e data entries 



are handled, rather than ^. There is thus no large-scale 
linear programming. Even when there is only one CPU, 
i.e., M = 1, (that is, essentially no parallel processing), 
the "parallel" algorithm based on the hierarchy will 
have run-time O {^N {^eJ^^ + log ^)), which is linearith- 
mic order in N rather than polynomial order. If we have 
enough CPUs to guarantee M is always no smaller than 
^, the second term of ^ is always a constant with 
respect to N. When N is large, the first term O [N log ■^) 
dominates, which is much lower in order than O {-j^^-rr)- 

3 Experiments 

To validate the advantage and usefulness of this algo- 
rithm, we conduct experiments on four different types 
of dataset, including image, video, protein sequence, and 
synthetic datasets. Though our work was originally mo- 
tivated by image concept learning, we apply the parallel 
D2-clustering algorithm to different scenarios. Readers in 
these related areas may therefore find applications that 
are suitable to embed the algorithm into. 

First, we test the algorithm on images crawled from 
Flickr in several experiments. When crawling images 
from Flickr, we use the Flickr API to perform keyword 
query for certain concepts, and download top images 
with highest values of interestingness. The number of 
images is different in different experiments. Typically 
it is at a scale of thousands, much larger than the scale 
the sequential D2-clusterLng has been applied. If such 
clustering is performed for multiple concepts, the train- 
ing dataset for image tagging can easily reach several 
millions, which is quite large for image annotation. 

For such image datasets, each image is first seg- 
mented into two sets of regions based on the color 
(LUV components) and texture features (Daubechies-4 
wavelet [6J coefficients) respectively. We then use two 
bags of weighted vectors U and V , one for color and 
the other for texture, to describe an image as / = {U, V). 
The distance between two images, h = (Ui.Vi) and 
Ij = {Uj,Vj), is defined as 

D{h,I,) = iD^{U,,U,) + D^{V,,V,))-^ , (9) 

where D{-,-) is the Mallows distance. It is straightfor- 
ward to extend the D2-clustering algorithm to the case 
of multiple discrete distributions using the combined 
distance defined in ||9j. See [22] for details. The order 
of time complexity increases simply by a multiplicative 
factor equal to the number of distribution types, the so- 
called super-dimension. 

Second, we adopt videos queried and downloaded 
from Youtube. We represent each video by a bag of 
weighted vectors, and conduct parallel D2-clustering on 
these videos. Then we check the accordance between the 
clustering result and the videos' genre. 

For the video clustering, we adopt the features used 
in (39|. Edge Direction Histogram (EDH), Gabor (GBR), 
and Grid Color Moment (GCM) features are extracted 
from each frame. A video is segmented into several 
sub-clips based on the continuity of these three features 
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among nearby frames I23II . Using the time percentage of 
each sub-clip as the weight of its average feature vector, 
we represent the video by a bag of weighted vectors by 
combining all sub-clips. 

Third, we download the SwissProt protein data 
and apply clustering on a subset of this dataset. Prosite 
protein family data IITSl provides the class labels of these 
protein sequences. Using the Prosite data, we can select 
protein sequences from several certain classes as the ex- 
periment dataset, which contains data gathering around 
some patterns. Each protein sequence is composed of 20 
types of amino acids. We count the frequency of each 
amino acid in a sequence, and use the frequency as the 
signature of the sequence. 

At last, we synthesize some bags of weighted vectors 
following certain distribution and cluster them. 

The parallel program is deployed on a computation 
cluster at The Pennsylvania State University named "Cy- 
berStar" consisting of 512 quad-core CPUs and therefore 
2048 computation units. Since M = 2048, theoretically it 
can be guaranteed that at each level every data segment 
can be processed in parallel by a unique CPU core when 
the data size is several thousands. In practice the system 
will put a limit to the maximal number of processors 
each user can occupy at a time because it is a publicly 
shared server. 

In this section, we first evaluate objectively the D2- 
clustering algorithm on image datasets with different 
topics and scales. Then we show the D2-clustering 
results for the image, video, protein, and synthetic 
datasets respectively. 

3.1 Comparison with Sequential Clustering 

In order to compare the clustering results of the parallel 
and sequential D2-clustering, we have to run the exper- 
iment on relatively small datasets so that the sequential 
D2-clustering can finish in a reasonable time. In this 
experiment, 80 concepts of images, each including 100 
images downloaded from Flickr, are clustered individu- 
ally using both approaches. We compare the clustering 
results obtained for every concept. 

It is in general challenging to compare a pair of 
clustering results, as shown by some sincere efforts 
devoted to this problem IHTI . The difficulty comes from 
the need to match clusters in one result with those in 
another. There is usually no obvious way of matching. 
The matter is further complicated when the numbers 
of clusters are different. The parallel and sequential 
clustering algorithms have different initializations and 
work flows. Clearly we cannot match clusters based on a 
common starting point given by initialization. Moreover, 
the image signatures do not fall into highly distinct 
groups, causing large variation in clustering results. 

We compare the clustering results by three measures. 

First, we assess the quality of clustering by the gener- 
ally accepted criterion of achieving small within-cluster 
dispersion, specifically, the mean squared distance be- 
tween a data entry and its corresponding centroid. 



Second, we define a distance between two sets of cen- 
troids. Let the first set of centroids be Z = {zi, z^.} and 
the second set be Z' = {z[, z^, }. A centroid Zi (or z'^) is 
associated with a percentage pi (or p') computed as the 
proportion of data entries assigned to the cluster of (or 
z'j). We define a Mallows-type distance D{Z, Z') based on 
the element- wise distances D{zi,Zj), where D{zi,Zj) is 
the Mallows distance between two centroids, which are 
both discrete distributions. We call D{Z,Z') Mallows- 
type because it is also the square root of a weighted sum 
of squared distances between the elements in Z and Z'. 
The weights satisfy the same set of constraints as those 
in the optimization problem for computing the Mallows 
distance. 

k k' 

D\Z, Z') = 5]5]w;.,,i?'(z„z;), 

i=l 3 = 1 

k' k (10) 

subject to: ^ Wij = Pi , ^ mj = p'j , 

Wi,j > 0, i = 1, . . . , fc, j = 1, . . . , fc' . 

We refer to D{Z, Z') as the MM distance (Mallows-type 
weighted sum of the squared Mallows distances). 

Third, we use a measure for two ways of partition- 
ing a dataset, which is proposed in [|41||. This measure 
only depends on the grouping of data entries and is 
irrelevant with centroids. We refer to it as the categorical 
clustering distance. Let the first set of clusters be "P = 
{Pi,P2,...,-Pfc} and the second be V = P^, P^,}. 
Then the element- wise distance Dp{Pi,Pj) is the num- 
ber of data entries that belong to either Pj or Pj, 
but not the other. The distance between the parti- 
tions Dp{V,V') = E^=lJ2Ll^^,JDp{P^,P') is again a 



Mallows-type weighted sum of Dp{Pi,P'), 



1, . . . , fc. 



j = 1, . . . , fc'. See gl] for details. 

For each concept category, the number of clusters is 
set to 10. The mean squared distance from an image 
signature to its closest centroid is computed based on the 
clustering result obtained by the parallel or sequential 
D2-clustering algorithm. These mean squared distances 
are plotted in Fig. 4(a)[ We also compute the mean 
squared distance using randomly generated clusters for 
each concept. The random clustering is conducted 10 
times and the average of the mean squared distances 
is recorded as a baseline. The random clusters, un- 
surprisingly, have the largest mean squared distances. 
In comparison, the parallel and sequential algorithms 
obtain close values for the mean squared distances for 
most concepts. Typically the mean squared distance by 
the parallel clustering is slightly larger than that by the 
sequential clustering. This shows that the parallel clus- 
tering algorithm compromises the tightness of clusters 
slightly for speed. 

Fig. |4(b)| compares the clustering results using the 
MM distance between sets of centroids acquired by 
different methods. For each concept i, the parallel al- 
gorithm generates a set of centroids Zi = {zi, . . . , z^^} 
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(a) (b) (c) 

Fig. 4. Comparison between parallel and sequential clustering, (a) Mean squared distances from samples to 
corresponding centroids. The clustering results acquired by parallel and sequential D2-clustering algorithm on 80 
sets are similar in terms of mean squared distances to centroids. (b) Comparison of the squared MM distances fTOt 
between different sets of clustering centroids. Clustering centroids derived from different approaches are compared, 
(c) The categorical clustering distances L41J between results by different clustering approaches. 



and the sequential algorithm generates a set of cen- 
troids Z[ — {z-, . . . , z^, }. The solid curve in Fig. |4(b)| 

plots D'^{Zi,Zl) for each concept i. To demonstrate that 
D{Zi,Z[) is relatively small, we also compute I){Zi,Zj) 
for all pairs of concepts i j- Let the average df = 
■^Yfj=i,j^iD'^{Zi,Z'^), where c = 80 is the number 
of concepts. We use as a baseline for comparison 
and plot it by the dashed line in the figure. For all 
the concepts, di is substantially larger than D{Zi,Z[), 
which indicates that the set of centroids Zi derived from 
the parallel clustering is relatively close to Z[ from the 
sequential clustering. Another baseline for comparison 
is formed using random partitions. For each concept 
i, we create 10 sets of random clusters, and compute 
the average over the squared MM distances between 
Zi and every randomly generated clustering. Denote 
the average by df, shown by the dashed dot line in 
the figure. Again comparing with df , the MM distances 
between Zi and Z[ are relatively small for all concepts i. 



Fig. |4(c)| p lots the categorical clustering distance de- 
fined in f41] between the parallel and sequential cluster- 
ing results for every image concept. Again, we compute 
the categorical clustering distance between the result 
from the parallel clustering and each of 10 random par- 
titions. The average distance is shown by the dash line 
in the figure. For most concepts, the clustering results 
from the two algorithms are closer than those from 
the parallel algorithm and random partition. However, 
for many concepts, the categorical clustering distance 
indicates substantial difference between the results of the 
parallel and sequential algorithm. This may be caused to 
a large extent by the lack of distinct groups of images 
within a concept. 

In summary, based on all the three measures, the clus- 
tering results by the parallel and sequential algorithms 
are relatively close. 



3.2 Scalability Property 

In Section 12.51 we derive that the parallel D2-clustering 
runs in approximately linearithmic time, while the se- 
quential algorithm scales up poorly due to its polyno- 
mial time complexity. In order to demonstrate this, we 
perform clustering on sets of images with different sizes 
using both the parallel and sequential D2-clustering with 
different conditions, and examine the time consumed. 

Fig. |5] shows the running time on sets of images in the 
concept "mountain". In the plot, both axes are in loga- 
rithmic scales. All versions of clusterings are performed 
on datasets with sizes 50, 100, 200, 400, 800, 1600, 3200, 
and 6400. We test the parallel D2-clustering algorithm in 
three cases with identical parameters r = 50 and e = 5. 
In the first case, there is only one slave CPU handling all 
the clustering requests sent by the master CPU (i.e. there 
are two CPUs employed in total). All clustering requests 
are therefore processed sequentially by the only slave 
processor. In the second case, there are 16 slave CPUs (i.e. 
17 CPUs in total). In the third case, the conditions are the 
same to the second case, but the data segmentation is im- 
plemented by a Vector Quantization (VQ) [12] approach 
instead of the approach in Section 12.21 For comparison, 
the original sequential D2-clustering algorithm is also 
tested on the same datasets. 

Fig. |5] shows the parallel algorithm scales up approxi- 
mately at linearithmic rate. On a log-log plot, this means 
the slope of the curve is close to 1. The number of 
slave CPUs for the parallel processing, as discussed in 
Section 12.51 contributes a linear difference (the second 
term in (0) to the run-time between the first and second 
parallel cases. When N is small, the linear difference 
may be relatively significant and make the single CPU 
case much slower than the multiple CPU case. But the 
difference becomes less so when the dataset is large 
because there is a dominant linearithmic term in the 
run-time expression. Nevertheless, in either case, the 
parallel algorithm is much faster than the sequential 
algorithm. The slope of the curve on the log-log plot for 
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Fig. 5. Running time of clustering on different data 
size. Botin axes are in logaritlnmic scales. Both sequential 
and parallel D2-clustering algorithms are tested. For the 
parallel algorithm there are three cases, depending on the 
number of slave CPUs, and whether Vector Quantization 
(VQ) is adopted as the segmentation approach. 

the sequential algorithm is much larger than 1, indicating 
at least polynomial complexity with a high degree. As 
a matter of fact, the parallel algorithm with 16 slave 
CPUs takes 88 minutes to complete clustering on the 
dataset containing 6400 images, in contrast to almost 8 
hours consumed by the sequential algorithm running 
only on the dataset of size 200. Because the time for 
the sequential D2-clustering grows dramatically with the 
data size, we can hardly test it on dataset larger than 200. 

For the last parallel case, we include VQ in the seg- 
mentation in order to further accelerate the algorithm. 
Based on the analysis in Section IZH we know that data 
segmentation takes a lot of time at each level of the 
algorithm. In the VQ approach, we quantify the bag 
of weighted vectors to histograms and treat them as 
vectors. These histograms are then segmented by K- 
means algorithm in the segmentation step. The cluster- 
ing within each segment is still D2-clustering on the 
original data. However the time spent for segmentation 
is much shorter than the original approach in Section IZ21 
In Fig. |5] the acceleration is reflected by the difference 
between the run-time curves of parallel D2-clustering 
with and without VQ segmentation when N is large. 
When iV is small and the data segmentation is not 
dominant in the run-time, VQ will usually not make 
the clustering faster. In fact due to bad segmentation 
of such a coarse algorithm, the time consumed for D2- 
clustering within each segment might be longer. That 
is the reason why the parallel D2-clustering with VQ 
segmentation is slower than its counterpart without VQ 
segmentation (both have same parameters and 16 slave 
CPUs employed) in Fig. |5] when TV is smaller than 800. 

Theoretically VQ can reduce the order of the clustering 
from linearithmic to linear (referred by llZ|l). However 
because the quantization step loses some information. 




Fig. 6. Clustering result for 1488 images with concept 
"mountain" downloaded from Flickr. It takes 873 seconds 
to generate 13 clusters by using 30 slave CPU cores. Top 
20 images from 7 largest clusters with size larger than 1 00 
are displayed. The relative distance between two clusters 
represents the distance between their centroids. 

the clustering result might be less accurate. This can be 
reflected by the MM distance (defined in i fTOt ) between 
the parallel D2-clustering with VQ segmentation and the 
sequential D2-clustering results on a dataset containing 
200 images, which is 19.57 on average for five runs. 
Compared to 18.41 as the average MM distance between 
the original parallel D2-clustering and sequential D2- 
clustering results, the VQ approach makes the parallel 
clustering result less similar to the sequential clustering 
result which is regarded as the standard. 

No matter whether VQ is adopted, the experiment 
shows the acceleration of parallel D2-clustering over 
the sequential algorithm, which verifies the run-time 
analysis in Section |231 Applying the algorithm, we can 
greatly increase the scale of discrete distribution cluster- 
ing problems. 

3.3 Visualization of Image Clustering 

Simulating the real image annotation application, the 
parallel D2-clustering algorithm is applied to a set of 
1488 images downloaded from Flickr (r = 50, e = 5). 
These images are retrieved by query keyword "moun- 
tain". For such a set, we do not have ground truth for the 
clustering result, and the sequential D2-clustering cannot 
be compared because of its unrealistic running time. We 
thus visualize directly the clustering result and attempt 
for subjective assessment. 

The 1488 images are clustered into 13 groups by the 
parallel D2-clustering algorithm using 873 seconds (with 
30 slave CPUs). At the end, there are 7 clusters with 
sizes larger than 100, for each of which 20 images closest 
to the corresponding cluster centroid are displayed in 
groups in Fig. (6] In the figure we try to manually arrange 
similar clusters close, and dissimilar ones far away based 
on the Mallows distances between the cluster centroids. 
Fig. [6] shows that images with similar color and texture 
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are put in the same clusters. More clustering results for 
images with different concepts are show in Fig. [71 For 
all the datasets, visually similar images are grouped into 
the same clusters. The parallel algorithm has produced 
meaningful results. 

3.4 Video Clustering 

To demonstrate the application of parallel D2-clustering 
on video clustering, we download 450 videos from 
Youtube. They are all 5 to 20 minutes in length, and 
come from 6 different queries, which are "news", "soc- 
cer", "lecture", "movie trailer", "weather forecast", and 
"tablet computer review". Each category contains 75 
videos. We compare the clustering result with the cat- 
egory label to check the correctness of the algorithm. 

It should be emphasized that video classification is a 
complicated problem. Since our algorithm is an unsu- 
pervised clustering algorithm rather than a classification 
method, we cannot expect it to classify the videos with a 
high accuracy. In addition, though not the major concern 
in this paper, the visual features for segmenting and 
describing videos are crucial for the accuracy of the 
algorithm. Here we only adopt some easy-to-implement 
simple features for demonstration. Therefore the purpose 
of this experiment is not to pursue the best video classi- 
fication, but to demonstrate the reasonable results of the 
parallel D2-clustering algorithm on videos. The videos' 
class labels, which serve as the ground truth, are used 
as a reference to show whether similar videos can be 
clustered into a same group. 

In this experiment, we set t = 30, e = 5, and request 15 
slave CPUs (A/ = 15) to cluster the data. As mentioned 
above, we adopt three visual features, EDH, GBR, and 
GCD, to segment and describe each video. We extract 
the features and form the video descriptors before the 
clustering. Typically a video will be segmented into 10 
to 30 sub-clips, depending on its content and length. In 
addition, the feature vector itself is high in dimension 
(73 for EDH, 48 for GBR, and 225 for GCD). As a result, 
each video is represented by a large bag of weighted 
vectors. 



TABLE 1 

Video Clustering Result by Parallel D2-Clustering. 



Cluster 


CI 


C2 


C3 


C4 


C5 


C6 


(Size) 


(81) 


(36) 


(111) 


(40) 


(72) 


(75) 


Soccer 


43 


5 


1 


1 


8 


10 


Tablet 


19 


23 


14 


1 


9 


4 


News 


9 


3 


27 


6 


17 


5 


Weather 


4 





31 


32 


1 





Lecture 


4 


2 


32 





17 


13 


Trailer 


2 


3 


6 





20 


43 



By applying the parallel D2-clustering algorithm, we 
can get a clustering result for this high dimensional 
problem in 847 seconds. 6 major clusters (C1-C6) are 
generated and Table [T] is the confusion table. We then try 
to use the clustering result for classification by assigning 
each cluster with the label corresponding to its majority 



class. For these six clusters, the unsupervised clustering 
achieves a classification accuracy of 47.5%. 

3.5 Protein Sequence Clustering 

In bioinformatics, composition-based methods for se- 
quence clustering and classification (either DNA fT9ll , 
fl^ or protein [TTl) use the frequencies of different com- 
positions in each sequence as the signature. Nucleotides 
and amino acids are basic components of DNA and 
protein sequence respectively and these methods use a 
nucleotide or amino acid histogram to describe a se- 
quence. Because different nucleotide or amino acid pairs 
have different similarities determined by their molecular 
structures and evolutionary relationships, cross-term re- 
lations should be considered when we compute the dis- 
tance between two such histograms. As a result, the Mal- 
lows distance would be a proper metric in composition- 
based approaches. However, due to the lack of effective 
clustering algorithms under the Mallows distance, most 
existing clustering approaches either ignore the relation- 
ships between the components in the histograms and 
treat them as variables in a vector, or perform clustering 
on some feature vectors derived from the histograms. 
In this case, D2-clustering will be especially appealing. 
Considering the high dimensionality and large scale of 
biological data, parallel algorithm is necessary. 

In this experiment, we perform parallel D2-clustering 
on 1,500 protein sequences from Swiss-Prot database, 
which contains 519,348 proteins in total. The 1,500 pro- 
teins are selected based on Prosite data, which provides 
class labeling for part of Swiss-Prot database. We ran- 
domly choose three classes from Prosite data, and extract 
500 protein sequences from Swiss-Prot database for each 
class to build our experiment dataset. 

Each protein sequence is transformed to a histogram 
of amino acid frequencies. There is a slight modification 
in the computation of the Mallows distance between two 
such histograms over the 20 amino acids. In Equation |[T]|, 
the squared Euclidean distance between two support 
vectors is replaced by a pair-wise amino acid distance 
provided in the PAM250 mutation matrix 130 J . Given any 
two amino acids A and B, we can get the probabilities 
of A mutated to B and B mutated to A from the matrix. 
The distance between A and B is defined as 

DmM25o{A, B) = log {P{A\B)) + log {P{B\A)) . (11) 

Because the support vectors for each descriptor are the 
20 types of amino acids and hence symbolic, we will skip 
the step to update support vectors in the centroid update 
of D2-clustering (refer to Step 2 in Algorithm 1) in the 
implementation of the parallel D2-clustering algorithm 
in this case. We set r = 30, e = 5, and request 7 slave 
processors to perform the parallel clustering (7\/ = 7). 
We let the program generate 5 clusters and the clustering 
finishes in about 7 hours. Considering the high dimen- 
sionality of the histograms and the scale of the dataset, 
this is a reasonable time. Three out of the five clusters are 
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(a) Animals 



(b) China 





(c) CInurch 



(d) Halloween 



Fig. 7. Visualized clustering results on four image concepts. Each concept contains 1 ,000 images from Flickr. They are 
clustered by the parallel D2-clustering algorithm and top ranked images in large clusters are displayed. Visually similar 
images are grouped into same clusters, though images in a same cluster do not always stand for similar semantics. 



major ones, which contain more than 99% of the dataset. 
Fig. 8(a) describes the clustering centroids. 

If we do not consider cross-term differences in the 
distance between two histograms and use Euchdean 
distance as the metric, the clustering is reduced to K- 
means. Therefore in this application, we are able to 
apply K-means on the dataset. Running on the multi- 
CPU server, the K-means algorithm is also implemented 
by parallel programs. We parallelize K-means in two 
different ways. In the first one, we apply the approach 
in fTS], which is to perform centroid update and label 
update in parallel for different segments of data and 
then combine partial result linearly. In this way the 
parallelization is equivalent to the original K-means on 
the whole dataset. In the second way, we adopt the 
same hierarchical clustering structure, as in Fig. [TJ used 
in the parallel D2-clustering algorithm and only replace 
the Mallows distance by the Euclidean distance. Thus 
within each data segment at any level of the hierarchy, 
the locally performed clustering is K-means. However 
there will be some loss of accuracy since several data 
are grouped and treated as one object at higher levels' 
clustering, and will never be separated thereafter. For 
short, we refer to the first implementation as parallel 
K-means and the second one as hierarchical K-means. 



troids generated by parallel K-means and hierarchical 
K-means, running on the same dataset with identical 
parameters. Apparently, none of parallel K-means' or 
hierarchical K-means' centroids reveals distinct patterns 
compared with the centroids acquired by parallel D2- 
clustering. In order to prove the fact objectively, we 
compute the Davies-Bouldin index (DBI) |7J for all the 
three clustering results as the measurement of their 
tightness. DBI is defined as 



DB 



1 



max 
k ^ i^j 



d{zj,zi) 



(12) 



where k is the number of clusters, zj is the centroid of 
cluster j, d{zj, zi) is the distance from zj to zi, and (Tj is 
the average distance from zj to all the elements in cluster 
j. DBI is the average ratio of intra-cluster dispersion 
to inter-cluster dispersion. Lower DBI means a better 
clustering result with tighter clusters. 

TABLE 2 

Davies-Bouldin index for different clustering results. 



Distance used in <12t 


Parallel 
D2 


Parallel 
K-means 


Hierarchical 
K-means 


Squared Mallows 


1.55 


2.99 


2.01 


Squared L2 


1.69 


5.54 


2.04 



Fig. 8(b) and Fig. 8(c) show the clustering cen- We compute DBI using both the squared Mallows 
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50.80% 27.60% 21.47% 

(a) Parallel D2-clustering 
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(b) Parallel K-means 
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49.00% 


38.13% 
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(c) Hierarchical K-means 

Fig. 8. Cluster centroids computed by different ap- 
proaches in the protein clustering experiment. Each plot, 
corresponding to one centroid, shows a histogram over 
the 20 amino acids. The number below each histogram 
is the percentage of data points belonging to the corre- 
sponding cluster. 

distance and the squared Euclidean distance as d{-) 
in |[T2] | for each clustering result. Table |2] shows the 
result. Parallel D2-clustering gains the lowest DBI in 
both cases. This indicates good tightness of the clusters 
generated by parallel D2-clustering. In contrast, the two 
implementations of K-means generate looser clusters 
than parallel D2-clustering. Though both can complete 
very fast, their clustering results are less representative. 

3.6 Evaluation with Synthietic Data 

Except for all the above experiments on real data, we also 
apply the algorithm on a synthetic dataset. The S3mthetic 
data are bags of weighted vectors. We carefully sample 
the vectors and their weights so that the data gather 
around several centroids, and their distances to the 
corresponding centroids follow the Gamma distribution, 
which means they follow a Gaussian distribution in a 
hypothetical space [22]. By using this dataset, we have 
some ideally distributed data with class labels. 

We create a synthetic dataset containing 15 clusters, 
each composed of 100 samples around the centroid. The 
clusters are well separated. The upper bound for the 
number of clusters in the algorithm is set 20. It takes 
about 8 minutes to cluster the data (with 30 slave CPUs). 
Then 17 clusters are generated. We try to match these 
17 clusters to the ground truth clusters: 9 of the 17 
clusters are identical to their corresponding true clusters; 
5 clusters with sizes close to 100 differ from the true 
clusters only by a small number of samples; 2 clusters 
combine precisely into one true cluster; and at last a 
small cluster of size 8 may have formed from some 
outliers. It is evident that for this synthetic dataset, the 



parallel D2-clustering has produced a result close to the 
ground truth. 

4 Conclusion 

As a summary, a novel parallel D2-clustering algorithm 
with dynamic hierarchical structure is proposed. Such 
algorithm can efficiently perform clustering operations 
on data that are represented as bags of weighted vectors. 
Due to the introduction of parallel computing, the speed 
of the clustering is greatly enhanced, compared to the 
original sequential D2-clustering algorithm. By deploy- 
ing the parallel algorithm on multiple CPUs, the time 
complexity of D2-clustering can be improved from high- 
ordered polynomial to linearithmic time, with minimal 
approximation introduced. The parallel D2-clustering 
algorithm can be embedded into various applications, 
including image concept learning, video clustering, and 
sequence clustering problems. In the future, we plan to 
further develop image concept learning and annotation 
applications based on this algorithm. 
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